Algorithm scours health records for lung cancer risk

Siru Liu, PhD, Adam Wright, PhD, and colleagues at Vanderbilt University Medical Center have developed a computer algorithm that scans electronic health records, or EHRs, to identify patients who meet criteria for lung cancer screening. As reported in The International Journal of Medical Informatics, the algorithm uses natural language processing, or NLP, to vastly outperform a scan relying strictly on smoking information entered in designated fields.

Lung cancer is the leading cause of cancer deaths in the U.S. Screening with annual low-dose CT scans can catch cases early, and health care payers answer to federal guidelines that currently recommend screening for smokers ages 50 to 80 with a 20-pack-year history who still smoke or who quit within the last 15 years. Screening could lower mortality by 20%, studies have found.

In 2021 only 5.8% of those who met the risk criteria were screened, according to the American Lung Association. And studies have found that the screening rate among Blacks is less than half that among whites.

EHRs include fields expressly for documenting smoking status, packs per day, pack years and quit dates. This structured data is intended in part to support automated EHR alerts aimed at improving screening rates. But very often patients and clinicians don’t fill in these fields. Often this key smoking information instead figures as unstructured data in clinical notes, more easily overlooked by care teams, and missed entirely by clinical alert systems.

To demonstrate their algorithm, the team used records of 102,475 VUMC primary care patients ages 50 to 80, 40% having a history of smoking reflected in structured data.

The algorithm avails itself first of the system’s structured data concerning pack years and quit dates, which happened to be missing for 43% of patients with a documented history of smoking. To extract relevant information from clinical notes, the algorithm uses NLP, with a demonstrated accuracy of 96%.

The team compared their hybrid algorithm, operating on three year’s worth of records, to a baseline scan using the most recent structured EHR data. The algorithm found 10,231 patients eligible for screening, which came to 1.74 times more than the baseline’s 5,887. And compared to the baseline, the algorithm found 2.2 times as many eligible Black patients.

“Our results show that electronic clinical decision support for lung cancer screening stands to be greatly improved, both overall and with special regard to health care equity for Blacks, by strategies that include unstructured EHR data,” said Liu, assistant professor of Biomedical Informatics, the paper’s first author.

Wright, the senior author, is professor of Biomedical Informatics and director of the Vanderbilt Clinical Informatics Center. Others from Vanderbilt on the study include Allison McCoy, PhD, Melinda Aldrich, PhD, Kim Sandler, MD, Thomas Reese, PhD, PharmD, Bryan Steitz, PhD, and Elise Russo, MPH, PMP. They were joined by two researchers from the University of Florida, where the NLP algorithm was originally developed. The study was supported by the National Institutes of Health (AG062499, LM014097).